Putting Data to practical use - Multivariate Analysis of Starbucks drinks in Kids and Others category
In this take home exercise the task will be to create a data visualization to segment Starbucks drinks menu in the ‘kids and other’ category by their nutrition indicators.
The challenge of this exercise is to show the numerous attributes that represent the nutritional values in a manner that is understandable and could be readily used for exploring the different segments.
The propose solution would be a heat map with each row indicating the each of the drinks in the menu and respective nutritional indicators as the columns. Using heatmaply native hierarchical clustering function, the drinks would be segment by their nutritional values.
For this task following r packages are loaded:
packages = c('tidyverse', 'corrplot', 'seriation', 'heatmaply', 'rmarkdown', 'dendextend' )
for(p in packages){library
if(!require(p, character.only = T)){
install.packages(p)
}
library(p, character.only = T)
}
For this data visualization, a provided data set consisting of the different drink menu items in Starbucks and their nutritional values will be used.
The data is saved in csv format. Dplyr read_csv method will be used to load the data.
starbucks <- read_csv("data/starbucks_drink.csv")
For this exercise, as we are only concerned about drinks from the kids and other category, this category of data is filter out to a tibble object.
kids_drinks <-starbucks %>% filter(Category == 'kids-drinks-and-other')
Exploring the data, an entry of vanilla creme seem to be mislabeled as its values do not correspond with values of the same drink. Highlighted below it is only drink with 24oz size in this category despite being labelled as “Tall” size and have high levels of caffeine when other similar drinks have 0 caffeine. It would be safe to remove this data row, as shown in code chunk below with the filter function using the portion value.
kids_drinks <- kids_drinks %>% filter(kids_drinks$`Portion(fl oz)` != 24)
The ‘Caffeine’ variable by default is loaded as chr variable. Below it is converted to numeric.
kids_drinks$`Caffeine(mg)` <- parse_number(kids_drinks$`Caffeine(mg)`)
At this stage a correlogram is generated to explore how correlated the nutritional info are too each other. Prior to generating the plot, variables would need to be computed in to correlation matrix first using r cor function. All numeric columns are selecting using column number indexing.
kids_drinks.cor <- cor(kids_drinks[, 3:15])
The correlogram is generated with corrplot function and specifying the lower portion of the plot to display correlation as ellipses and upper portion as their correlation figures in the code chunk below. Other options specified below include:
corrplot.mixed(kids_drinks.cor,
upper = "number",
lower = "ellipse",
tl.pos = "lt",
diag = "l",
tl.col = "black",
tl.cex = 0.8,
number.cex = 0.7)
The generated figure above supports the intuition that the portion size is correlated to few of the nutritional measures such as Calories, Sodium, Total Carbohydrate and Sugars as bigger drinks would contain more content leading to higher values in those markers.
Most of the other variables such as calories are also positively correlated with many other variables as they could be derivatives or compositions of each attribute. For example, Total fat, saturated fat and Calories from fat. Interestingly Caffeine in the drink is highly correlated with dietary fiber.
Thus as means of reducing the rows of data to be displayed the size of drink could be filtered to one size of drink. To make sure that the data contains all the combinations of drinks (name, type of milk and whipped) after filtering, the options are concatenated below and the number of unique combinations are checked in the code chunk below.
kids_drinks2 <- kids_drinks %>% unite(drink, c("Name", "Milk", "Whipped Cream"))
length(unique(kids_drinks2$drink))
[1] 60
Below, the data is filtered to a specific drink size and number of unique drink combinations is checked again to ensure none of the drink combinations is filtered out.
With the same number of unique drink combinations, the menu could be filter down to only “Tall” size drinks. Checking the summary statistics below, by filtering to one size of the drink, results in all drink combinations with 0g of trans fat. With trans fat correlated to other variables it is assumed that removing it would not affect segmentation results significantly.
kids_drinks_filter <- (filter(kids_drinks2, kids_drinks2$Size == "Tall"))
length(unique(kids_drinks_filter$drink))
[1] 60
summary(kids_drinks_filter)
Category drink Portion(fl oz)
Length:60 Length:60 Min. : 8.00
Class :character Class :character 1st Qu.:12.00
Mode :character Mode :character Median :12.00
Mean :11.93
3rd Qu.:12.00
Max. :12.00
Calories Calories from fat Total Fat(g) Saturated fat(g)
Min. : 90.0 Min. : 0.00 Min. : 0.000 Min. : 0.000
1st Qu.:210.0 1st Qu.: 50.00 1st Qu.: 6.000 1st Qu.: 1.500
Median :250.0 Median : 75.00 Median : 8.500 Median : 5.000
Mean :255.7 Mean : 76.75 Mean : 8.517 Mean : 4.767
3rd Qu.:300.0 3rd Qu.:110.00 3rd Qu.:12.000 3rd Qu.: 7.000
Max. :420.0 Max. :150.00 Max. :17.000 Max. :11.000
Trans fat(g) Cholesterol(mg) Sodium(mg) Total Carbohydrate(g)
Min. :0 Min. : 0.00 Min. : 15.0 Min. :14.00
1st Qu.:0 1st Qu.: 0.00 1st Qu.:120.0 1st Qu.:31.75
Median :0 Median :25.00 Median :130.0 Median :36.50
Mean :0 Mean :20.25 Mean :154.9 Mean :37.83
3rd Qu.:0 3rd Qu.:30.00 3rd Qu.:182.5 3rd Qu.:42.25
Max. :0 Max. :50.00 Max. :290.0 Max. :61.00
Dietary Fiber(g) Sugars(g) Protein(g) Caffeine(mg)
Min. :0.000 Min. :12.00 Min. : 0.0 Min. : 0
1st Qu.:0.000 1st Qu.:29.00 1st Qu.: 4.0 1st Qu.: 0
Median :1.000 Median :33.00 Median : 9.0 Median : 0
Mean :1.533 Mean :34.42 Mean : 7.4 Mean : 8
3rd Qu.:3.000 3rd Qu.:39.25 3rd Qu.:10.0 3rd Qu.:20
Max. :4.000 Max. :52.00 Max. :12.0 Max. :20
Size
Length:60
Class :character
Mode :character
In the next step, columns that do not contribute to the segmentation are removed. Size and Portion column is removed as the visualization will be segmenting the drink across one size. Category is dropped as there is only one category after filtering. Trans fat is dropped as it has value of zero for all drinks of this size.
kids_drinks_selected <- kids_drinks_filter %>% select(- `Portion(fl oz)`, -`Category`, -`Size`, -'Trans fat(g)')
In the step below the drink combinations are converted to the row names and the data is converted to a data matrix
row.names(kids_drinks_selected) <- kids_drinks_selected$drink
kd_matrix <- data.matrix(kids_drinks_selected, c(2:13))
To segment the difference drinks, hierarchical clustering with the heat map would be used. The attributes used for clustering are numerical, the default distance option of using dissimilarity between rows is used. Euclidean distance is used to measure dissimilarity between the clusters. The values are scaled using the percentize method as it allows and intuitive comparison of the values in each column. In the code chunk below a statistical method is used to determine appropriate hierarchical clustering method.
dist_methods hclust_methods optim
1 unknown ward.D 0.5590574
2 unknown ward.D2 0.5659402
3 unknown single 0.4100543
4 unknown complete 0.5693458
5 unknown average 0.6239010
6 unknown mcquitty 0.6010018
7 unknown median 0.2481516
8 unknown centroid 0.3868984
Based on above code chunk the average clustering method gives the best result. In the code chunk below, optimal number of clusters is determined when using ‘average’ as the clustering method.
Combining the results obtained above, the code chunk below will be used to generate the heatmap with additional statements to improve on clarity of the visualization. The options defined below are:
In the resultant plot, drinks are arrange by combinations of their nutritional markers. Drink on the higher range of all attributes are positioned towards the top of the chart and correspond to relatively more unhealthy drinks.
Reviewing the segmented drinks menu from the top to bottom row, could see the relatively most unhealthy drink is “Salted Caramel Hot Chocolate” with whipped cream, with high percentile of sugar, carbohydrates and sodium indicators. Varying the type of milk used contributes to increasing the amount of protein in the drink, however does little to reduce unhealthy markers. Drinks with coconut milk would have less protein compared with drinks with whole milk.
Drinks with whipped cream are mostly arranged on the top half of the plot due to higher amounts of Calories and total fat indicators suggesting that regardless of the base drink addition of whipped cream contributes significantly to increasing healthy nutritional indicators.
Hot chocolate, Pumpkin spice hot chocolate and Salted caramel hot chocolate are among the most caffeinated drinks on the menu, which parents might wish to take note of, to prevent accidentally preventing kids from sleeping at night.
For healthier options, should look towards, Cinnamon Dolce Creme, Vanilla Creme or Pumpkin spice creme with no whipped cream and with either almond or non fat milk, located towards the bottom with lower amounts of calories. The option with lowest calories and fat content is Steamed Apple Juice, however it still have relatively high Sugars and Carbohydrates.